Pioneering advanced security solutions for reinforcement learning-based adaptive key rotation in Zigbee networks

In the rapidly evolving landscape of Internet of Things (IoT), Zigbee networks have emerged as a critical component for enabling wireless communication in a variety of applications. Despite their widespread adoption, Zigbee networks face significant security challenges, particularly in key management and network resilience against cyber attacks like distributed denial of service (DDoS). Traditional key rotation strategies often fall short in dynamically adapting to the ever-changing network conditions, leading to vulnerabilities in network security and efficiency. To address these challenges, this paper proposes a novel approach by implementing a reinforcement learning (RL) model for adaptive key rotation in Zigbee networks. We developed and tested this model against traditional periodic, anomaly detection-based, heuristic-based, and static key rotation methods in a simulated Zigbee network environment. Our comprehensive evaluation over a 30-day period focused on key performance metrics such as network efficiency, response to DDoS attacks, network resilience under various simulated attacks, latency, and packet loss in fluctuating traffic conditions. The results indicate that the RL model significantly outperforms traditional methods, demonstrating improved network efficiency, higher intrusion detection rates, faster response times, and superior resource management. The study underscores the potential of using artificial intelligence (AI)-driven, adaptive strategies for enhancing network security in IoT environments, paving the way for more robust and intelligent Zigbee network security solutions.

In the realm of Internet of Things (IoT), Zigbee technology has emerged as a cornerstone for establishing reliable, low-power, and wireless communication networks.Predominantly used in applications ranging from home automation to industrial control systems, Zigbee's efficiency and flexibility make it a preferred choice in a myriad of IoT scenarios 1 .However, the increasing dependency on Zigbee networks has escalated concerns regarding their security.With threats ranging from unauthorized access to data integrity breaches, the security of Zigbee networks is pivotal for the safe operation of IoT systems.Zigbee networks traditionally rely on standard security protocols that include symmetric key encryption and static key rotation methods.While these measures provide a fundamental level of security, they are increasingly inadequate against sophisticated cyber threats.Static key rotation schedules, although useful, lack the adaptability required in dynamic network environments due to their inherent predictability, lack of responsiveness to changing conditions, inefficiency in balancing security and performance, and scalability issues in diverse network segments.These limitations necessitate the development of adaptive key rotation methods that can dynamically respond to real-time security threats and optimize network performance.
In light of these challenges, there is a pressing need for security mechanisms that are not only robust but also agile and adaptive to evolving threats.Adaptive key rotation, where the cryptographic keys are changed dynamically based on real-time network conditions and threat levels, represents a promising solution.However, the

Data transmission security
WSNs in terms of latency and reliability, particularly in the railroad industry 23 .Kulasekara et al. introduce a novel Zigbee-based smart anti-theft system for electric bikes, improving personal security and reducing power consumption 24 .Lastly, Nourildean et al. review IoT-based WSNs, affirming Zigbee's role in facilitating low-power, low-cost communication in various IoT applications 25 .
In the latest research, some advanced techniques are used to enhance the security of IoT systems.The authors introduce a hybrid privacy-preserving federated learning framework that effectively protects against irregular users in next-generation IoT environments 26 .The federated learning ensures data privacy while maintaining robust performance against adversarial attack 27 .A secure intelligent fuzzy blockchain framework that enhances threat detection capabilities in IoT networks by integrating fuzzy logic with blockchain technology 28 .Moreover, the authors utilize federated learning for cyber threat hunting in blockchain-based industrial IoT networks.This method enhances the detection and mitigation of cyber threats by leveraging the strengths of both federated learning and blockchain technology 29 .
These studies collectively underscore the evolving nature of security challenges in IoT and the diverse approaches being explored to address these challenges, ranging from enhanced encryption methods to innovative applications in various industrial and consumer contexts.
Zigbee networks, particularly in smart home systems, confront significant security challenges.With regard to general security challenges, the authors emphasize the difficulty in detecting, defending, and foreseeing vulnerabilities, suggesting the use of attack graph generation for security analysis 2 .The decentralised nature of Zigbee ad-hoc networks presents unique security challenges, particularly in maintaining network security and intrusion detection 3 .Meanwhile, the authors point out the trade-off between security and the goals of simplicity and low cost in Zigbee network technology, often leading to compromised security features 4 9 .Ramsey et al. introduce a multi-factor Phy-MAC-NWK security framework, using RF Phy features to enhance bit-level security 10 .In recent years, a number of emerging security technologies have been used to further address the above issues.Ren et al. demonstrate the effectiveness of Z-Fuzzer, a device-agnostic fuzzing tool, in detecting vulnerabilities in Zigbee protocol implementations 11 .Fard et al. focus on rogue device discrimination in Zigbee networks using wavelet transform and autoencoders 12 .Hussein et al. highlight the practicality of conventional attacks like MQTT-based DoS, MITM, and masquerade attacks in commercial home automation IoT devices, underscoring the need for improved security 13 .Ruiz et al. discuss the challenges in designing heterogeneous wireless sensors for IoT, including power constraints, security, and quality of service parameters 14 .Hong et al. address the security challenges in data transmission in the IoT, including Zigbee, such as label information interception and sensor network node DoS attacks 15 .
The security of ZigBee networks has been extensively studied, with various inherent features and vulnerabilities identified, real and proof-of-concept (PoC) attacks documented, and numerous mitigation techniques proposed.Table 2 highlights the core aspects of ZigBee network security and summarizes various studies that have addressed these issues.By providing this comprehensive overview, we aim to situate our research within the broader context of existing work and underscore the significance of our proposed reinforcement learningbased adaptive key rotation strategy.
This study introduces a pioneering approach by integrating a Reinforcement Learning (RL) model into the Zigbee security framework.RL, a branch of machine learning, offers the ability to learn optimal behaviors through interactions with the environment 30 .By employing an RL model, such as Q-learning or Deep Q-Networks (DQNs), for the decision-making process in key rotation, the proposed system aims to intelligently adapt its security measures in real-time.Q-learning is a model-free reinforcement learning algorithm that aims to find the optimal action-selection policy by learning Q-values for each action-state pair.These Q-values represent the expected utility of taking a particular action in a given state.The algorithm updates its Q-values using the Bellman equation, iteratively improving its policy based on the rewards received.DQNs extend Q-learning by using In our study, these techniques enable the development of an adaptive key rotation method that dynamically responds to the security state of the Zigbee network, improving both security and performance.This approach is expected to enhance the resilience of Zigbee networks against emerging threats while maintaining optimal network performance.The contributions of this paper can be summarized as: 1.The integration of RL in Zigbee network security is a novel venture, poised to set a new standard in adaptive security mechanisms.2. This research aims to not only develop and implement the RL-based key rotation system but also to empirically evaluate its effectiveness in enhancing Zigbee network security.3. Extensive experimental results are used to verify the superiority of the proposed scheme.The proposed schemes of this paper are anticipated to provide significant insights and a solid foundation for future advancements in IoT network security.
The structure of this paper is as follows.The proposed method is provided in "Proposed method" section, followed by the simulation results in "Experimental results" section.Finally, conclusions are drawn in "Conclusion" section.

Proposed method
Confronting the dynamic challenges in Zigbee network security, particularly in key management and resilience against cyber threats such as DDoS attacks, this section introduces an innovative approach utilizing a RL model for adaptive key rotation.

Encryption and key management in zigbee
Zigbee employs advanced encryption standard (AES)-128 for encryption, where the encryption function can be represented as where E k (x) is the encrypted output, x is the plaintext input, k i represents the key used in the i th round, and ⊕ denotes the XOR operation.Meanwhile, key rotation is essential for maintaining security.The periodic rotation can be modeled as where t rotation is the time for the next key rotation, t initial is the time of the initial key establishment, n is the number of completed rotations, and t is the set time interval.

Reinforcement learning basics
RL is a machine learning method that learns by interacting with the environment.It attempts to learn a policy by maximizing the cumulative reward that reflects the effect of its action in the environment.Q-learning is a special RL algorithm.Each possible action in Q-learning has a corresponding Q value, which represents the pros and cons of taking that action in a specific state.The Q-learning update rule is given by Here, Q(s, a) is the current Q-value for a state s and action a.The update is based on the immediate reward r, the discounted maximum Q-value of the next state s ′ for all possible actions a ′ , γ is the discount factor (which balances immediate and future reward), and α is the learning rate (which determines to what extent the newly acquired information overrides the old information).The policy π at any state s can be derived from the Q-table as where the policy at any state s is the action a that has the highest Q-value in state s.Under the policy π , the value function V can be calculated by The value function represents the expected cumulative reward starting from state s, following policy π , where r t is the reward at time t.Using the Bellman optimality equation, the above equation can be optimized as The Bellman optimality equation provides the basis for finding the optimal value function V * (s) .It states that the value of a state s under an optimal policy is the maximum expected return achievable, taking into account the

Reinforcement learning in Zigbee key rotation
Using RL in Zigbee key rotation, the detailed parameters are defined as follows: The state space S = {s 1 , s 2 , s 3 , s 4 } , where s 1 denotes time elapsed since the last key rotation, s 2 is the number of detected unauthorized access attempts, s 3 represents the network traffic volume, and s 4 is the historical data of key rotation effectiveness.The action space A consists of two primary actions, rotate key (a 1 ) and maintain current key(a 2 ) .The policy π is a function that maps states to actions.Using a softmax selection rule, the policy for state s can be expressed as where τ is the temperature parameter controlling the exploration-exploitation balance.Specifically, the high temperature parameter τ promotes exploration by making the probability distribution over actions more uni- form, encouraging the agent to try different actions and gather more information about the environment.The low temperature parameter τ favors exploitation by concentrating the probability distribution on actions with higher estimated rewards, encouraging the agent to choose actions that have previously yielded high rewards.
In our method, τ is dynamically adjusted to achieve an optimal balance between exploration and exploita- tion.Initially, a higher τ is used to promote exploration.As the agent learns and gathers more information, τ is gradually decreased according to an annealing schedule.The annealing schedule can be linear, exponential, or based on other decay functions.We used an exponential decay schedule, i.e. τ = τ 0 × exp(− × t) , where τ 0 is the initial temperature, is the decay rate, and t is the time step.Finally, during training, τ is periodically adjusted based on the agent's performance.If the agent is not exploring enough (indicated by low variance in action selection), τ is temporarily increased to encourage more exploration.
The reward function R(s, a) is designed to capture the immediate and long-term consequences of actions.It includes components for security, performance, and operational costs: where w 1 , w 2 , and w 3 are weights indicating the importance of each component.A higher weight w 1 is assigned to the security component to emphasize its importance in maintaining network integrity and protecting against attacks.A moderate weight w 2 is assigned to the performance component to ensure that the key rotation method does not degrade overall network efficiency.A lower weight w 3 is assigned to the cost component to ensure cost efficiency while not compromising security and performance.
Our objective is to find an optimal policy π * that maximizes the expected cumulative reward.The optimiza- tion problem can be formulated as which subject to operational constraints like latency and resource usage.

Reward function and policy optimization
The Q-learning update formula, crucial in Zigbee networks, is given by where α t is the time-dependent learning rate, which can be modeled as where α 0 is the initial learning rate and decayrate determines the rate of reduction over time.
An advanced model for the Security component, could be represented by a weighted sum of various security metrics where m i (s, a) represents different security metrics and w i denotes their respective weights.

Evaluation criteria
Key performance indicators (KPIs) are monitored via in-built analytics tools of Network Simulator 3 (NS3), providing real-time data on network performance.The baseline for traditional key rotation method can be shown in Table 3.
In this paper, we employ advanced statistical techniques such as hypothesis testing and confidence interval analysis to assess the significance of the observed differences and explore real data simulations through the following comparative analysis formula to quantify improvements.Moreover, NS3 is configured to simulate real-world conditions with parameters like signal strength, interference, and packet loss.Network scenarios include everyday usage, peak load times, and attack simulations like DoS and spoofing.The state space configuration is shown in Table 5.

Results
The proposed method, designed to outperform traditional key rotation strategies, was rigorously tested in a simulated Zigbee environment, focusing on critical performance metrics over a 30-day period.These metrics included network efficiency, response to DDoS attacks, resilience under varied attacks, and traffic condition adaptability.Our results significantly demonstrate the superiority of the RL model over conventional methods, marking a substantial advancement in Zigbee network security by offering a more dynamic, intelligent, and efficient solution.
Figure 1 illustrates the dynamic performance of different key rotation strategies.The RL model, represented by the blue line, shows a remarkable and consistent upward trend in efficiency, evidencing its strong adaptability.In contrast, the traditional periodic rotation (red line) exhibits fluctuations, suggesting variability in its performance.The anomaly detection-based rotation (green line) and heuristic-based rotation (purple line) demonstrate moderate performance with some variability, but neither matches the steady improvement of the RL model.The RL model displayed a consistent upward trend in network efficiency, outperforming traditional periodic rotation, anomaly detection-based, and heuristic-based rotations.It can be attributed to the proposed RL model not only enhances the overall network throughput but also ensures more consistent performance across various scenarios, surpassing traditional strategies in both reliability and efficiency.
In Fig. 2, the result provides a clear performance comparison of different methods during a DDoS attack.The RL model excels with a 92% intrusion detection rate and an 18-second response time, showcasing its effectiveness in handling cyber threats efficiently.The traditional method scores lower in both detection rate and response time, highlighting potential vulnerabilities.The anomaly detection and heuristic-based rotations show balanced performances but are not as optimal as the RL model.The superior performance of the proposed scheme can be attributed to the adaptive nature of our RL-based system, which learns and evolves to recognize and respond to new threats more effectively than static, traditional systems.Figure 3 offers insights into each method's performance under varying traffic conditions.The RL model stands out for maintaining the lowest latency and packet loss, indicating its efficiency and adaptability.The traditional method struggles under both low and high traffic conditions, while the traffic volume-based and predictive analysis-based rotations show moderate performance levels.
Figure 4 displays the varying efficiencies of different key rotation methods.The adaptive key rotation efficiency refers to the effectiveness and performance of a system's key rotation mechanism when dynamically adapting to changing network conditions and security threats.The RL model consistently outperforms other models, maintaining high efficiency throughout the period.The traditional method, while more variable, shows lower efficiency, and the adaptive method, though better than the traditional, still does not reach the RL model's levels.The static method lags significantly, reflecting a lack of adaptability and efficiency.
In Fig. 5, we compare the resilience of different key management strategies under various attack types.The network resilience score is a metric that quantifies the robustness and stability of a network in maintaining its performance and functionality despite facing various types of cyber-attacks and adverse conditions.The RL model scores impressively across all scenarios, affirming its robustness against diverse cyber threats.In contrast, the traditional and adaptive methods exhibit moderate resilience, with more significant variances in performance.The static method consistently ranks lowest, underscoring the need for more dynamic strategies.
Figure 6 compares the resource utilization of each method.The RL model shows the most efficient resource management, particularly under high-stress conditions.The traditional and adaptive methods consume more resources, with the traditional method being slightly less efficient.The static method's high resource utilization, especially under stress, indicates inefficiencies and scalability challenges.www.nature.com/scientificreports/

Limitations and challenges of RL-based adaptive key rotation
While the RL-based adaptive key rotation method offers significant improvements in adapting to dynamic network conditions, it is important to recognize and address its inherent limitations and challenges.
Susceptibility to dynamic adversarial attacks RL models, despite their adaptive capabilities, can be vulnerable to dynamic adversarial attacks.These attacks involve an adversary that continuously changes its strategy to mislead the RL agent, potentially causing suboptimal or insecure key rotation decisions.To mitigate this risk, our future work will explore robust RL techniques such as adversarial training and the integration of anomaly detection mechanisms that can identify and counteract adversarial behaviors.

Computational overhead
The implementation of RL models in real IoT environments poses significant computational challenges.The continuous learning and adaptation process requires considerable processing power and memory, which can strain the limited resources of IoT devices.To address this, we propose the development of lightweight RL algorithms optimized for IoT devices, and the use of edge computing to offload intensive computations from individual devices to more capable edge servers.

Scalability issues
Scalability is another critical factor, as the RL model needs to manage key rotation across a potentially large number of devices in a Zigbee network.We plan to investigate hierarchical RL approaches that distribute the computational load and allow for efficient management of key rotation at different network levels.By acknowledging these limitations and outlining potential solutions, we aim to provide a more comprehensive understanding of the feasibility and applicability of RL-based adaptive key rotation in Zigbee networks.

Conclusion
In this paper, we explored the innovative application of RL for enhancing security in Zigbee networks through adaptive key rotation strategies.We have identified the unique challenges Zigbee networks face, particularly in key management and resilience against network threats like DDoS attacks.Our proposed RL-based approach dynamically adjusts key rotation policies, demonstrating significant improvements over traditional methods in intrusion detection rates, response times, and resource management.Our experimental findings underscore the effectiveness of RL in adapting to varying network conditions, offering a robust solution to maintaining network integrity and security.By continuously learning from the network environment, our approach efficiently balances security needs with operational performance.
In the future, there are promising avenues for further research.Enhancing the RL model for even more nuanced decision-making and extending this methodology to a broader range of network security scenarios could yield substantial benefits.The potential of RL in cybersecurity is immense, particularly in its ability to adapt and respond to evolving threats.

Figure 2 .
Figure 2. Performance comparison of various methods under DDoS attack.

Figure 3 .
Figure 3. Network latency and packet loss during fluctuating traffic.

Figure 6 .
Figure 6.Resource utilization comparison under various conditions.

Table 1 .
Comparison of methods used in Zigbee network security studies.
. Stelte et al. note that nonhardened Zigbee networks are more susceptible to attacks like simple association flooding and packet replay attacks 5 .To address security threats in Zigbee networks, Misra et al. propose a PKI-enabled secure communication framework for Zigbee sensor networks, addressing limitations in memory and power consumption while introducing only a marginal increase in latency 6 .Liu et al. discuss the challenges in tracking dynamic encryption key updates due to Zigbee communication's inherent retransmission and packet loss 7 .Lee et al. propose a challenge-response approach to mitigate Sybil attacks in Zigbee networks 8 .Patel et al. improve Zigbee device network authentication using ensemble classifiers, addressing security challenges in decentralized networks

Table 2 .
Summary of ZigBee features, vulnerabilities, attacks, and mitigation techniques.deepneural networks to approximate the Q-value function, making it feasible to handle large state spaces.DQNs employ techniques such as experience replay and fixed Q-targets to stabilize and enhance the training process.

Table 5 .
State space definition.